We can first focus on the features that we extracted as informative in the recipients, to see if, taken together, they would be more informative than when working on one data type alone.
Tolerant versus non tolerant recipients
Since looking at differences between the three groups consisting of primary, secondary and non tolerant patients at the same time wasn’t feasible, we divided our problem in two distinct sub-problems. We first focus on the features that seemed to play a role when trying to disinguish non tolerant from tolerant recipients.
We first need to gather the features that were extracted from the three data sources:
CyTOF features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.
Metabolomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.
Transcriptomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.
I extract informations on the recipients group, age, gender, … to be able to use this information later on.
## [1] "0"
## [1] "1"
## Id.Cryostem.R GROUP GENDER DOB DOG
## D1_TOL D2497 non_tolerant F 1962-07-06 <NA>
## R1_2_TOL R2498 non_tolerant F 1954-12-05 2014-08-06
## D2_NoTOL D284 non_tolerant F 1967-04-03 <NA>
## R2_NoTOL R385 non_tolerant M 1953-12-16 2013-05-02
## D3_TOL D1502 primary_tolerant F 1990-04-03 <NA>
## R3_1_TOL R1503 primary_tolerant M 1991-05-19 2014-01-02
## gender_comp age_recip cmv_comp
## D1_TOL FF 21794 01
## R1_2_TOL FF 21794 01
## D2_NoTOL FM 21687 NA1
## R2_NoTOL FM 21687 NA1
## D3_TOL FM 8264 11
## R3_1_TOL FM 8264 11
I add a prefix (“cyt_”, “rna_” or “met_”) in front of the features names, and I then merge all features together:
## [1] "We have information on the CyTOF, metabolomics and transcriptomics data for 23 patients in the St Louis cohort."
Correlation analysis
#pdf("~/Desktop/correlationplot_integration_recip_TNT.pdf", width = 16, height = 16)
gr <- plot_correlations(mat4volcano, compar = "TNT")
#dev.off()
Random forest modeling
To see if the features that we selected can help us to clssify the patients, we apply a 5 fold crosse validation setting on the 23 patients for which we have CyTOF, metabolomics and transcriptomics information.
set.seed(1)
cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
train_set <- mat4volcano[-cv_id[[idx]], c("group", colnames(ft_integ)[-1])]
test_set <- mat4volcano[cv_id[[idx]], c(colnames(ft_integ)[-1])]
true_labels <- mat4volcano[cv_id[[idx]], "group"]
rf <- ranger::ranger(group~., train_set, num.trees = 5000,
importance = "impurity")
pred <- predict(rf, test_set)
acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})
accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.88
Principal component analysis
We can first generate a PCA on the data:
The first principal component seems to hold much more information that the others 
We can investigate how much of the principal components are driven by the different data types.
The RNAseq data variability was captured much more than the other data types. This is most probably due to the fact that we selected 278 genes, against 42 metabolites and 24 CyTOF features only. 
We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 (and slightly PC8 and PC10) also look correlated with tolerance. 
We can then see how the PCs are correlated with the different informations that we have on the patients.
The first PC seems to be most correlated with group, cGVHD and ABO.incompatibility:

The second PC seems to be slightly correlated with the disorder group:

The third PC seems to be slightly correlated with aGVHD and ABO.incompatibility:

The fourth PC seems to be most correlated with group and age_recip:

Since the 1st and 4th PCs seem to be most correlated with tolerance, we visualise the patients in these two dimensions. We observe that these 2 components seem to separate the tolerant recipients (on the bottom-left) from the non tolerant recipients (on the top-right). We can also see how the recipients are displayed in these two dimensions according to ABO incompatibility, recipients age and cGVHD. 
We can also map the cryostem patients on thiese PCs:
other <- readRDS("~/Documents/VIB/Projects/Integrative_Paris/Integrative/outputs/data/integ_national/pca_2plot.RDS")
other4pca <- other %>% dplyr::select(colnames(mat4pca))
rownames(other4pca) <- other$Id.Cryostem.R
new_coords <- scale(other4pca, pca$center, pca$scale) %*% pca$rotation
other_2plot <- other %>%
dplyr::select(Id.Cryostem.R, group, age_recip, cGVHD) %>%
mutate(PC1 = new_coords[,1],
PC4 = new_coords[,4])
all_2plot <- rbind(pca_2plot %>%
dplyr::select(Id.Cryostem.R, group, age_recip, cGVHD, PC1, PC4) %>%
mutate(cohort = rep("St_Louis", nrow(pca_2plot))),
other_2plot %>%
mutate(cohort = rep("Cryostem", nrow(other_2plot))))
p1 <- ggplot(all_2plot, aes(x= PC1, y= PC4, col= group, label = Id.Cryostem.R)) +
geom_point(aes(shape = cohort, size = cohort)) +
geom_text(aes(label=Id.Cryostem.R),hjust=-0.2, vjust=-0.2) +
ggtitle("Separation of tolerant vs non tolerant recipients")
# p2 <- ggplot(pca_2plot, aes(x= PC1, y= PC4, col= ABO.incompatibility, label = Id.Cryostem.R)) +
# scale_color_manual(values = c("chartreuse2", "black", "gray48")) +
# geom_point() +
# geom_text(aes(label=Id.Cryostem.R),hjust=0, vjust=0) +
# ggtitle("Separation of recipients based on ABO.incompatibility")
p3 <- ggplot(all_2plot, aes(x= PC1, y= PC4, col= age_recip, label = Id.Cryostem.R)) +
geom_point(aes(shape = cohort, size = cohort)) +
geom_text(aes(label=Id.Cryostem.R),hjust=-0.2, vjust=-0.2) +
ggtitle("Gradient in the recipients based on age")
p4 <- ggplot(all_2plot, aes(x= PC1, y= PC4, col= as.factor(cGVHD), label = Id.Cryostem.R)) +
scale_color_manual(values = c("black", "red")) +
geom_point(aes(shape = cohort, size = cohort)) +
geom_text(aes(label=Id.Cryostem.R),hjust=-0.2, vjust=-0.2) +
ggtitle("Separation of recipients based on cGVHD")
{p1 +p3 + p4}
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.
## Warning: Using size for a discrete variable is not advised.

Features expressed in the first prinipal component
We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in tolerant recipients. 
We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :
## feature overexp orig
## 1 ENSG00000228956 -0.08156929 rna
## 2 ENSG00000242861 -0.08037703 rna
## 3 CNKSR2 -0.08024199 rna
## 4 ABLIM1 -0.07877023 rna
## 5 SPON1 -0.07850535 rna
## 6 ENSG00000222078 -0.07788463 rna
## 7 HSF5 -0.07771093 rna
## 8 BACH2 -0.07761380 rna
## 9 AXIN2 -0.07732816 rna
## 10 LINC01215 -0.07698807 rna
## 11 C12orf42 -0.07669759 rna
## 12 ENSG00000280123 -0.07658899 rna
## 13 IFNG.AS1 -0.07635503 rna
## 14 KLHL3 -0.07623209 rna
## 15 DCBLD2 -0.07554311 rna
## 16 LACTB2.AS1 -0.07500851 rna
## 17 LINC01259 -0.07499767 rna
## 18 NELL2 -0.07492519 rna
## 19 KLF3.AS1 -0.07490429 rna
## 20 ENSG00000261685 -0.07409850 rna
## 21 androsterone.sulfate -0.07407656 met
## 22 RNU6.8 -0.07387305 rna
## 23 GPRASP1 -0.07363176 rna
## 24 BEND5 -0.07350428 rna
## 25 SBF2.AS1 -0.07322580 rna
## 26 dehydroisoandrosterone.sulfate..DHEA.S. -0.07316363 met
## 27 RETREG1 -0.07308135 rna
## 28 RIC3 -0.07245566 rna
## 29 LARGE1 -0.07216967 rna
## 30 MCC -0.07181769 rna
## 31 ZNF256 -0.07165713 rna
## 32 LOC100507053 -0.07163273 rna
## 33 STRBP -0.07159202 rna
## 34 FBLN5 -0.07143257 rna
## 35 ENSG00000218730 -0.07099955 rna
## 36 androstenediol..3beta.17beta..monosulfate..1. -0.07079649 met
## 37 PARM1 -0.07057739 rna
## 38 FCMR -0.07053659 rna
## 39 KCNH8 -0.07047437 rna
## 40 NPAS2 -0.07042575 rna
## 41 ENSG00000276097 -0.07030503 rna
## 42 BCL7A -0.07019059 rna
## 43 epiandrosterone.sulfate -0.06996150 met
## 44 GOLGA6L9 -0.06992679 rna
## 45 LOC105369827 -0.06980929 rna
## 46 PAIP2B -0.06944937 rna
## 47 ALDH5A1 -0.06913474 rna
## 48 SDK2 -0.06900392 rna
## 49 ENSG00000224610 -0.06894275 rna
## 50 SPAG16 -0.06859450 rna
## 51 ENSG00000211611 -0.06842521 rna
## 52 NT5E -0.06836499 rna
## 53 C19orf18 -0.06830055 rna
## 54 TCTN1 -0.06823324 rna
## 55 ZNF90 -0.06812791 rna
## 56 EPHA4 -0.06783398 rna
## 57 RASSF6 -0.06783038 rna
## 58 ENSG00000270659 -0.06782274 rna
## 59 ACSM3 -0.06746280 rna
## 60 COBLL1 -0.06735532 rna
## 61 USP44 -0.06733209 rna
## 62 ENSG00000218713 -0.06726899 rna
## 63 CLNK -0.06726424 rna
## 64 UBE2Q2P2 -0.06721565 rna
## 65 ABCB4 -0.06663164 rna
## 66 X1.stearoyl.2.arachidonoyl.GPC..18.0.20.4. -0.06638245 met
## 67 ZNF711 -0.06629134 rna
## 68 CNTNAP2 -0.06593364 rna
## 69 OCM -0.06558826 rna
## 70 TTC9 -0.06556499 rna
## 71 ENSG00000214797 -0.06547693 rna
## 72 PPFIBP1 -0.06519230 rna
## 73 B3GALT2 -0.06471202 rna
## 74 P2RY10 -0.06465634 rna
## 75 ENSG00000211679 -0.06459460 rna
## 76 MIR3679 -0.06459080 rna
## 77 SYNPO -0.06457483 rna
## 78 PLPP5 -0.06454993 rna
## 79 ZNF165 -0.06412746 rna
## 80 ZNF540 -0.06400003 rna
## 81 IL23R -0.06387896 rna
## 82 CHD7 -0.06376742 rna
## 83 AMIGO1 -0.06353382 rna
## 84 LAMP3 -0.06348932 rna
## 85 X23_DN..86.....CD8low..13...TSCM.like -0.06319538 cyt
## 86 ENSG00000225569 -0.06299126 rna
## 87 NRCAM -0.06280542 rna
## 88 OVGP1 -0.06278661 rna
## 89 CDNF -0.06259539 rna
## 90 ENSG00000235251 -0.06252194 rna
## 91 ENSG00000234902 -0.06234214 rna
## 92 ENSG00000211820 -0.06206949 rna
## 93 DAB1 -0.06203348 rna
## 94 ENSG00000254777 -0.06168796 rna
## 95 ENSG00000273062 -0.06150881 rna
## 96 MARCH3 -0.06145497 rna
## 97 ENSG00000144642 -0.06072733 rna
## 98 TTC28 -0.06070103 rna
## 99 ZC3H12B -0.06061452 rna
## 100 SDR42E1 -0.06050211 rna
## 101 PWAR5 -0.06018475 rna
We then focus on the features that are overexpressed in the non tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in non tolerant recipients. 
We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :
## feature
## 1 SRGAP1
## 2 P2RY1
## 3 C1QC
## 4 FLVCR2
## 5 IL10
## 6 ADAMTS2
## 7 PCYT1B
## 8 TCN2
## 9 TRIM58
## 10 MAFB
## 11 MMP8
## 12 MEIS1
## 13 C1QB
## 14 C1QA
## 15 ACSL1
## 16 TNS1
## 17 LTBP1
## 18 SPARC
## 19 SH3TC2
## 20 ARHGAP6
## 21 SELP
## 22 ITGB3
## 23 LCN2
## 24 ABLIM3
## 25 X.CD38._10_T.population.CD8...DP.CD4..FoxP3...40.70...
## 26 MAOB
## 27 CTDSPL
## 28 TTC7B
## 29 LANCL3
## 30 MFSD2B
## 31 SH3PXD2B
## 32 PEAR1
## 33 X.CD38._6_CD8.T.cells.FoxP3..CXCR3..CCR7..FAS.
## 34 SDC3
## 35 X.CD38._27_CD4.Treg
## 36 X.CTLA4._24_CD4.TCM.Th17.like
## 37 X.CD38._20_NK.cells..66.....Conventional.DCs...some.moDCs..34..
## 38 CA1
## 39 FBP1
## 40 ANKRD22
## 41 TREML1
## 42 X.CD38._17_CD8...87...DP..10....TEMRA..70....TSCM..30....CXCR5..B.markers
## 43 X.CD24._2_CD8.TEM..47.....TEMRA..42.....TCM..4.....TSCM..3..
## 44 SERPING1
## overexp orig
## 1 0.07303734 rna
## 2 0.07161886 rna
## 3 0.07095143 rna
## 4 0.07060925 rna
## 5 0.07044322 rna
## 6 0.06943897 rna
## 7 0.06662012 rna
## 8 0.06660225 rna
## 9 0.06563776 rna
## 10 0.06543088 rna
## 11 0.06515072 rna
## 12 0.06409927 rna
## 13 0.06391818 rna
## 14 0.06356339 rna
## 15 0.06329629 rna
## 16 0.06310627 rna
## 17 0.06225234 rna
## 18 0.06124110 rna
## 19 0.06045283 rna
## 20 0.05993583 rna
## 21 0.05951066 rna
## 22 0.05942689 rna
## 23 0.05921813 rna
## 24 0.05900356 rna
## 25 0.05891123 cyt
## 26 0.05873980 rna
## 27 0.05810987 rna
## 28 0.05768060 rna
## 29 0.05682371 rna
## 30 0.05677874 rna
## 31 0.05457941 rna
## 32 0.05424961 rna
## 33 0.05412829 cyt
## 34 0.05408341 rna
## 35 0.05370935 cyt
## 36 0.05360698 cyt
## 37 0.05326823 cyt
## 38 0.05300907 rna
## 39 0.05230039 rna
## 40 0.05119924 rna
## 41 0.05093662 rna
## 42 0.05071422 cyt
## 43 0.05028871 cyt
## 44 0.05012938 rna
Features expressed in the fourth prinipal component
We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in tolerant recipients. 
We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.6 threshold) :
## feature
## 1 leucine
## 2 beta.alanine
## 3 alanine
## 4 isoleucine
## 5 valine
## 6 histidine
## 7 asparagine
## 8 X4_CD8.Naives.CXCR3...60....CD8.naives..28...CD8.TSCM..16...CD8.TCM..7..
## 9 X3.methyl.2.oxovalerate
## 10 methionine
## 11 choline.phosphate
## 12 lysine
## 13 cysteine.glutathione.disulfide
## 14 X4.methyl.2.oxopentanoate
## 15 X1.arachidonoyl.GPE..20.4n6..
## 16 tryptophan
## 17 tyrosine
## overexp orig
## 1 -0.1546621 met
## 2 -0.1490560 met
## 3 -0.1373994 met
## 4 -0.1275363 met
## 5 -0.1272318 met
## 6 -0.1254381 met
## 7 -0.1249464 met
## 8 -0.1241974 cyt
## 9 -0.1233013 met
## 10 -0.1227031 met
## 11 -0.1218951 met
## 12 -0.1182365 met
## 13 -0.1174700 met
## 14 -0.1153245 met
## 15 -0.1106526 met
## 16 -0.1084344 met
## 17 -0.1019113 met
We then focus on the features that are overexpressed in the non tolerant recipients (= the features that are most expressed on the TOP of the PC4). Therefore, the higher the feature value, the more it is expressed in non tolerant recipients. 
We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.5 threshold) :
## feature
## 1 PDCD1LG2
## 2 GBP1P1
## 3 GBP5
## 4 LINC02528
## 5 methyl.4.hydroxybenzoate.sulfate
## 6 X.CD38._23_DN..86.....CD8low..13...TSCM.like
## 7 X.CD38._33_B.naives..100..
## 8 P2RY10
## 9 X.CD38._17_CD8...87...DP..10....TEMRA..70....TSCM..30....CXCR5..B.markers
## overexp orig
## 1 0.1438614 rna
## 2 0.1289141 rna
## 3 0.1167383 rna
## 4 0.1125767 rna
## 5 0.1090165 met
## 6 0.1061148 cyt
## 7 0.1052441 cyt
## 8 0.1004525 rna
## 9 0.1004288 cyt
We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## L2 Regularized Linear Support Vector Machines with Class Weights
##
## 23 samples
## 23 predictors
## 2 classes: 'non_tolerant', 'tolerant'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 19, 19, 18, 18, 18
## Resampling results across tuning parameters:
##
## cost Loss weight Accuracy Kappa
## 0.25 L1 1 0.82 0.6571429
## 0.25 L1 2 0.82 0.6571429
## 0.25 L1 3 0.82 0.6571429
## 0.25 L2 1 0.82 0.6571429
## 0.25 L2 2 0.87 0.7571429
## 0.25 L2 3 0.83 0.6662338
## 0.50 L1 1 0.82 0.6571429
## 0.50 L1 2 0.82 0.6571429
## 0.50 L1 3 0.82 0.6571429
## 0.50 L2 1 0.82 0.6571429
## 0.50 L2 2 0.87 0.7571429
## 0.50 L2 3 0.83 0.6662338
## 1.00 L1 1 0.82 0.6571429
## 1.00 L1 2 0.82 0.6571429
## 1.00 L1 3 0.82 0.6571429
## 1.00 L2 1 0.82 0.6571429
## 1.00 L2 2 0.87 0.7571429
## 1.00 L2 3 0.83 0.6662338
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L2 and weight = 2.
Compare the features found in the PC1 and PC4 to the MOFA features:
In MOFA, the factor one is most tolerance related, it consists mainly of CyTOF and Metabo features. In this MOFA factor, the recipients are nocely split into non tolerant (low loadings) to tolerant (high loadings on the first factor).
cyto_MOFA <- read.csv("../data/integ/MOFA_res/MOFA_cyto_4F_weights.csv")
metabo_MOFA <- read.csv("../data/integ/MOFA_res/MOFA_meta_4F_weights.csv")
rna_MOFA <- read.csv("../data/integ/MOFA_res/MOFA_RNA_4F_weights.csv")
MOFA_res <- rbind(cyto_MOFA, metabo_MOFA, rna_MOFA) %>%
mutate(orig = c(rep("cyt", nrow(cyto_MOFA)),
rep("met", nrow(metabo_MOFA)),
rep("rna", nrow(rna_MOFA))))
PC1 loadings: (low loadings: tolerant ++, high loadings: non tolerant)
# pc1_tol <- read.xlsx("../data/integ/tol_ft_PC1.xlsx")
overexp <- as.data.frame(pca$rotation[,c(1,4)]) %>%
rownames_to_column("X") %>%
magrittr::set_colnames(c("X", "PC1", "PC4"))
orig_split <- strsplit(overexp$X, "_")
orig <- map(orig_split, function(x){x[1]}) %>% unlist
overexp <- overexp %>%
mutate(orig = orig,
X = gsub("^[a-z]+_", "", X))
We can then compare the results of the PCA (focusing on the PCs 1 and 4, which were most associated with tolerance), to the 1st factor in MOFA (“LF1”): The concordance between the cyto and metabo features is quite good. The MOFA LF1 didn’t contain RNA features, which explains the uggly horizontal trend for this datatype.
both <- overexp %>%
left_join(MOFA_res, by = c("X", "orig")) %>%
mutate(PC1 = -PC1,
PC4 = -PC4)
## Warning: Column `X` joining character vector and factor, coercing into character
## vector
ggplot(both, aes(x = PC1, y = LF1, col = orig)) +
geom_point() +
scale_fill_manual(values = c("palegreen3", "mediumorchid4", "sandybrown")) +
ggtitle("Concordance between MOFA and PCA results")
## Warning: Removed 11 rows containing missing values (geom_point).

ggplot(both, aes(x = PC4, y = LF1, col = orig)) +
geom_point() +
scale_fill_manual(values = c("palegreen3", "mediumorchid4", "sandybrown")) +
ggtitle("Concordance between MOFA and PCA results")
## Warning: Removed 11 rows containing missing values (geom_point).

I can make a heatmap of the most important genes along the PC1:
pc1_rna_table <- overexp %>%
filter(orig == "rna") %>%
mutate(PC1 = abs(PC1)) %>%
arrange(desc(PC1))
pc1_rna_features <- pc1_rna_table$X[1:20]
rna_table <- ft_integ[,paste0("rna_", "", pc1_rna_features)]
rownames(rna_table) <- ft_integ$Id.Cryostem.R
colnames(rna_table) <- gsub("rna_", "", colnames(rna_table))
info_table <- info_complete_patients %>%
mutate(cGVHD = as.factor(cGVHD)) %>%
column_to_rownames("Id.Cryostem.R")
info_table <- info_table[rownames(rna_table),]
# arrange patients per PC1:
pca_table <- pca_2plot %>%
select(Id.Cryostem.R, PC1, PC4) %>%
arrange(PC1)
info_table <- info_table[pca_table$Id.Cryostem.R,]
rna_table <- rna_table[pca_table$Id.Cryostem.R,]
pheatmap::pheatmap(t(rna_table), cluster_cols = FALSE,
annotation_col = info_table[,c("group", "cGVHD", "ABO.incompatibility")],
annotation_colors = list(group = c(non_tolerant = "red", tolerant = "blue"),
ABO.incompatibility = c(Major = "red",
Minor = "orange",
Compatible = "blue")))

I can make a heatmap of the most important metabolites along the PC1:
pc1_met_table <- overexp %>%
filter(orig == "met") %>%
mutate(PC1 = abs(PC1)) %>%
arrange(desc(PC1))
pc1_met_features <- pc1_met_table$X[1:20]
met_table <- ft_integ[,paste0("met_", "", pc1_met_features)]
rownames(met_table) <- ft_integ$Id.Cryostem.R
colnames(met_table) <- gsub("met_", "", colnames(met_table))
info_table <- info_complete_patients %>%
mutate(cGVHD = as.factor(cGVHD)) %>%
column_to_rownames("Id.Cryostem.R")
info_table <- info_table[rownames(met_table),]
# arrange patients per PC1:
pca_table <- pca_2plot %>%
select(Id.Cryostem.R, PC1, PC4) %>%
arrange(PC1)
info_table <- info_table[pca_table$Id.Cryostem.R,]
met_table <- met_table[pca_table$Id.Cryostem.R,]
pheatmap::pheatmap(t(met_table), cluster_cols = FALSE,
annotation_col = info_table[,c("group", "cGVHD", "ABO.incompatibility")],
annotation_colors = list(group = c(non_tolerant = "red", tolerant = "blue"),
ABO.incompatibility = c(Major = "red",
Minor = "orange",
Compatible = "blue")))

I can make a heatmap of the most important CyTOF features along the PC1:
pc1_cyt_table <- overexp %>%
filter(orig == "cyt") %>%
mutate(PC1 = abs(PC1)) %>%
arrange(desc(PC1))
pc1_cyt_features <- pc1_cyt_table$X[1:20]
cyt_table <- ft_integ[,paste0("cyt_", "", pc1_cyt_features)]
rownames(cyt_table) <- ft_integ$Id.Cryostem.R
colnames(cyt_table) <- gsub("cyt_", "", colnames(cyt_table))
info_table <- info_complete_patients %>%
mutate(cGVHD = as.factor(cGVHD)) %>%
column_to_rownames("Id.Cryostem.R")
info_table <- info_table[rownames(cyt_table),]
# arrange patients per PC1:
pca_table <- pca_2plot %>%
select(Id.Cryostem.R, PC1, PC4) %>%
arrange(PC1)
info_table <- info_table[pca_table$Id.Cryostem.R,]
cyt_table <- cyt_table[pca_table$Id.Cryostem.R,]
pheatmap::pheatmap(t(cyt_table), cluster_cols = FALSE,
annotation_col = info_table[,c("group", "cGVHD", "ABO.incompatibility")],
annotation_colors = list(group = c(non_tolerant = "red", tolerant = "blue"),
ABO.incompatibility = c(Major = "red",
Minor = "orange",
Compatible = "blue")))

I can make a heatmap of the most important genes along the PC4:
pc4_rna_table <- overexp %>%
filter(orig == "rna") %>%
mutate(PC4 = abs(PC4)) %>%
arrange(desc(PC4))
pc4_rna_features <- pc4_rna_table$X[1:20]
rna_table <- ft_integ[,paste0("rna_", "", pc4_rna_features)]
rownames(rna_table) <- ft_integ$Id.Cryostem.R
colnames(rna_table) <- gsub("rna_", "", colnames(rna_table))
info_table <- info_complete_patients %>%
mutate(cGVHD = as.factor(cGVHD)) %>%
column_to_rownames("Id.Cryostem.R")
info_table <- info_table[rownames(rna_table),]
# arrange patients per PC1:
pca_table <- pca_2plot %>%
select(Id.Cryostem.R, PC1, PC4) %>%
arrange(PC4)
info_table <- info_table[pca_table$Id.Cryostem.R,]
rna_table <- rna_table[pca_table$Id.Cryostem.R,]
pheatmap::pheatmap(t(rna_table), cluster_cols = FALSE,
annotation_col = info_table[,c("group", "age_recip")],
annotation_colors = list(group = c(non_tolerant = "red", tolerant = "blue")))

Separate data type analyses
For comparison, we can see if we would have managed to separate the tolerant from non tolerant recipients as well without integrating the data:
CyTOF:
We perform PCA using only the 24 features that were selected as informative in both cohorts:
The first comoponent seems to be capturing most of the data variability: 
We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC6 is the 2nd mostly correlated with tolerance. 
The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA. 
We can also have a look at the features that are driving these two components: 
We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:
## L2 Regularized Linear Support Vector Machines with Class Weights
##
## 23 samples
## 10 predictors
## 2 classes: 'non_tolerant', 'tolerant'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 19, 19, 18, 18, 18
## Resampling results across tuning parameters:
##
## cost Loss weight Accuracy Kappa
## 0.25 L1 1 0.79 0.6142857
## 0.25 L1 2 0.79 0.6142857
## 0.25 L1 3 0.79 0.6142857
## 0.25 L2 1 0.79 0.6142857
## 0.25 L2 2 0.75 0.5233766
## 0.25 L2 3 0.71 0.4476190
## 0.50 L1 1 0.75 0.5233766
## 0.50 L1 2 0.75 0.5233766
## 0.50 L1 3 0.75 0.5233766
## 0.50 L2 1 0.75 0.5233766
## 0.50 L2 2 0.75 0.5233766
## 0.50 L2 3 0.75 0.5233766
## 1.00 L1 1 0.75 0.5233766
## 1.00 L1 2 0.75 0.5233766
## 1.00 L1 3 0.75 0.5233766
## 1.00 L2 1 0.71 0.4476190
## 1.00 L2 2 0.71 0.4476190
## 1.00 L2 3 0.75 0.5233766
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.
The accuracy is quite lower than in the integrative model.
To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the CyTOF features of the same 23 patients.
set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(cyt4pca),]
cyt4rf <- cbind(info_rf$group, cyt4pca)
colnames(cyt4rf)[1] <- "group"
cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
train_set <- cyt4rf[-cv_id[[idx]], ]
test_set <- cyt4rf[cv_id[[idx]], -1]
true_labels <- cyt4rf[cv_id[[idx]], "group"]
rf <- ranger::ranger(group~., train_set, num.trees = 5000,
importance = "impurity")
pred <- predict(rf, test_set)
acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})
accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.88
Metabolomics:
Variability captured by each component: 
We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC4 is the 2nd mostly correlated with tolerance. 
The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA. 

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:
## L2 Regularized Linear Support Vector Machines with Class Weights
##
## 23 samples
## 10 predictors
## 2 classes: 'non_tolerant', 'tolerant'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 19, 19, 18, 18, 18
## Resampling results across tuning parameters:
##
## cost Loss weight Accuracy Kappa
## 0.25 L1 1 0.83 0.6424242
## 0.25 L1 2 0.83 0.6424242
## 0.25 L1 3 0.83 0.6424242
## 0.25 L2 1 0.83 0.6424242
## 0.25 L2 2 0.79 0.5655012
## 0.25 L2 3 0.79 0.5655012
## 0.50 L1 1 0.83 0.6424242
## 0.50 L1 2 0.83 0.6424242
## 0.50 L1 3 0.83 0.6424242
## 0.50 L2 1 0.79 0.5655012
## 0.50 L2 2 0.79 0.5655012
## 0.50 L2 3 0.79 0.5655012
## 1.00 L1 1 0.79 0.5655012
## 1.00 L1 2 0.79 0.5655012
## 1.00 L1 3 0.79 0.5655012
## 1.00 L2 1 0.74 0.4655012
## 1.00 L2 2 0.74 0.4655012
## 1.00 L2 3 0.74 0.4655012
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.
Again, the accuracy of the model is slightly lower than the accuracy of the integrative model.
To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the metabolites of the same 23 patients.
set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(met4pca),]
met4rf <- cbind(info_rf$group, met4pca)
colnames(met4rf)[1] <- "group"
cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
train_set <- met4rf[-cv_id[[idx]], ]
test_set <- met4rf[cv_id[[idx]], -1]
true_labels <- met4rf[cv_id[[idx]], "group"]
rf <- ranger::ranger(group~., train_set, num.trees = 5000,
importance = "impurity")
pred <- predict(rf, test_set)
acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})
accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.8266667
Transcriptomics:
Variability captured by each component: 
We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC8 is the 2nd mostly correlated with tolerance. 
The non tolerant recipients are slightly less well separated from the tolerant recipients than in the integrative PCA. 

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:
## L2 Regularized Linear Support Vector Machines with Class Weights
##
## 23 samples
## 10 predictors
## 2 classes: 'non_tolerant', 'tolerant'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 19, 19, 18, 18, 18
## Resampling results across tuning parameters:
##
## cost Loss weight Accuracy Kappa
## 0.25 L1 1 0.53 0.03846154
## 0.25 L1 2 0.53 0.03846154
## 0.25 L1 3 0.53 0.03846154
## 0.25 L2 1 0.53 0.03846154
## 0.25 L2 2 0.53 0.03846154
## 0.25 L2 3 0.53 0.03846154
## 0.50 L1 1 0.53 0.03846154
## 0.50 L1 2 0.53 0.03846154
## 0.50 L1 3 0.53 0.03846154
## 0.50 L2 1 0.53 0.03846154
## 0.50 L2 2 0.53 0.03846154
## 0.50 L2 3 0.53 0.03846154
## 1.00 L1 1 0.53 0.03846154
## 1.00 L1 2 0.53 0.03846154
## 1.00 L1 3 0.53 0.03846154
## 1.00 L2 1 0.53 0.03846154
## 1.00 L2 2 0.53 0.03846154
## 1.00 L2 3 0.53 0.03846154
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.
It looks like, in our integrative approach, we managed to separate the recipients based on their tolerance better than when we look at the different data types separately.
To see if the features that we selected can help us to clssify the patients, we can also apply a 5 fold cross validation setting on the genes of the same 23 patients.
set.seed(1)
info_rf <- info_complete_patients %>% column_to_rownames("Id.Cryostem.R")
info_rf <- info_rf[rownames(rna4pca),]
rna4rf <- cbind(info_rf$group, rna4pca)
colnames(rna4rf)[1] <- "group"
cv_id <- list(c(1:5), c(6:10), c(11:15), c(16:20), c(21:23))
cv_accuracies <- lapply(seq_along(cv_id), function(idx){
train_set <- rna4rf[-cv_id[[idx]], ]
test_set <- rna4rf[cv_id[[idx]], -1]
true_labels <- rna4rf[cv_id[[idx]], "group"]
rf <- ranger::ranger(group~., train_set, num.trees = 5000,
importance = "impurity")
pred <- predict(rf, test_set)
acc <- ( length(which(pred$predictions == true_labels))/length(true_labels) )
})
accuracy <- mean(unlist(cv_accuracies))
accuracy
## [1] 0.84
Primary versus secondary tolerant recipients
We now focus on the features that seemed to play a role when trying to disinguish primary from secondary tolerant recipients.
We first need to gather the features that were extracted from the three data sources:
CyTOF features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.
Metabolomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.
Transcriptomics features that were selected in both the St Louis and the Cryostem cohort, with a threshold of 0.90. These features had the same behaviour in both cohorts.
I extract informations on the recipients group, age, gender, … to be able to use this information later on.
## [1] "0"
## [1] "1"
## Id.Cryostem.R GROUP GENDER DOB DOG
## D1_TOL D2497 non_tolerant F 1962-07-06 <NA>
## R1_2_TOL R2498 non_tolerant F 1954-12-05 2014-08-06
## D2_NoTOL D284 non_tolerant F 1967-04-03 <NA>
## R2_NoTOL R385 non_tolerant M 1953-12-16 2013-05-02
## D3_TOL D1502 primary_tolerant F 1990-04-03 <NA>
## R3_1_TOL R1503 primary_tolerant M 1991-05-19 2014-01-02
## gender_comp age_recip cmv_comp
## D1_TOL FF 21794 01
## R1_2_TOL FF 21794 01
## D2_NoTOL FM 21687 NA1
## R2_NoTOL FM 21687 NA1
## D3_TOL FM 8264 11
## R3_1_TOL FM 8264 11
I add a prefix (“cyt_”, “rna_” or “met_”) in front of the features names, and I then merge all features together:
## [1] "We have information on the CyTOF, metabolomics and transcriptomics data for 10 patients in the St Louis cohort."
Correlation analysis
Principal component analysis
We can first generate a PCA on the data:
The first principal component seems to hold much more information that the others 
We can investigate how much of the principal components are driven by the different data types.
The RNAseq data variability was captured much more than the other data types. This is most probably due to the fact that we selected 278 genes, against 42 metabolites and 24 CyTOF features only. 
We can see how these principal components are correlated with tolerance. The first PC is very correlated with the tolerance. The PC2 also looks correlated with tolerance. 
We can then see how the PCs are correlated with the different informations that we have on the patients.
The first PC seems to be most correlated with group, SAL and blood incompatibility:

The second PC seems to be most correlated with the source of graft, cGVHD and group:

The third PC seems to be correlated with cmv:

The fourth PC seems to be slightly correlated with the recipient’s age:

Since the 1st and 2nd PCs seem to be most correlated with tolerance, we visualise the patients in these two dimensions. We observe that these 2 components seem to separate the secondary tolerant recipients (on the bottom-left) from the primary tolerant recipients (on the top-right). We can also see how the recipients are displayed in these two dimensions according to recipients age and aGVHD. 



Features expressed in the first principal component
We can first look at the PC1, since it holds most of the data’s variability: We first focus on the features that are overexpressed in the primary tolerant recipients (= the features that are most expressed on the LEFT of the PC1). Therefore, the lower the feature value, the more it is expressed in primary tolerant recipients. 
We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC1 (under the -0.2 threshold) :
## feature overexp orig
## 1 CDC25A -0.25699203 rna
## 2 MZT2B -0.25058094 rna
## 3 PCLAF -0.24943522 rna
## 4 E2F8 -0.24781996 rna
## 5 TOP2A -0.24297857 rna
## 6 UBE2C -0.24196127 rna
## 7 MKI67 -0.24151215 rna
## 8 SNORD17 -0.23577399 rna
## 9 ASB2 -0.22842881 rna
## 10 FAM83D -0.22491879 rna
## 11 ENSG00000233597 -0.22387887 rna
## 12 MAPT -0.21923703 rna
## 13 ENSG00000210107 -0.21302546 rna
## 14 ENSG00000280981 -0.21246004 rna
## 15 CCDC157 -0.20362641 rna
## 16 KLRC2 -0.18556896 rna
## 17 ASAP3 -0.17480115 rna
## 18 AGMAT -0.17203260 rna
## 19 LOC101929574 -0.16686789 rna
## 20 TMEM53 -0.13343402 rna
## 21 X3.aminoisobutyrate -0.08116272 met
## 22 GPRC5D.AS1 -0.05982952 rna
We then focus on the features that are overexpressed in the secondary tolerant recipients (= the features that are most expressed on the RIGHT of the PC1). Therefore, the higher the feature value, the more it is expressed in secondary tolerant recipients. 
We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC1 (over the 0.05 threshold) :
## feature overexp orig
## 1 X.CD38._16_T.population.Naive.TSCM 0.12087507 cyt
## 2 ENSG00000256658 0.08235933 rna
## 3 PRUNE2 0.07847496 rna
## 4 X2.palmitoyl.GPC..16.0.. 0.06876445 met
Features expressed in the third prinipal component
We first focus on the features that are overexpressed in the secondary tolerant recipients (= the features that are most expressed at the BOTTOM of the PC2). Therefore, the lower the feature value, the more it is expressed in secondary tolerant recipients. 
We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the lowest features in the PC2 (under the -0.3 threshold) :
## feature overexp orig
## 1 X.CD38._29_DP.Treg -0.4574753 cyt
## 2 X.CD38._16_T.population.Naive.TSCM -0.3850877 cyt
## 3 HACD1 -0.3342307 rna
We then focus on the features that are overexpressed in the primary tolerant recipients (= the features that are most expressed at the TOP of the PC2). Therefore, the higher the feature value, the more it is expressed in primary tolerant recipients. 
We extract the “importance values” for these genes, metabolites and CyTOF features in an excel file. Here’s a sample of the highest features in the PC2 (over the 0.2 threshold) :
## feature overexp orig
## 1 X3.aminoisobutyrate 0.4032602 met
## 2 KLRC2 0.3071328 rna
We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:
## L2 Regularized Linear Support Vector Machines with Class Weights
##
## 10 samples
## 10 predictors
## 2 classes: 'primary_tolerant', 'secondary_tolerant'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 8, 8, 8, 8, 8
## Resampling results across tuning parameters:
##
## cost Loss weight Accuracy Kappa
## 0.25 L1 1 1.0 1.0
## 0.25 L1 2 1.0 1.0
## 0.25 L1 3 1.0 1.0
## 0.25 L2 1 1.0 1.0
## 0.25 L2 2 1.0 1.0
## 0.25 L2 3 1.0 1.0
## 0.50 L1 1 1.0 1.0
## 0.50 L1 2 1.0 1.0
## 0.50 L1 3 1.0 1.0
## 0.50 L2 1 1.0 1.0
## 0.50 L2 2 1.0 1.0
## 0.50 L2 3 1.0 1.0
## 1.00 L1 1 1.0 1.0
## 1.00 L1 2 1.0 1.0
## 1.00 L1 3 1.0 1.0
## 1.00 L2 1 0.9 0.8
## 1.00 L2 2 1.0 1.0
## 1.00 L2 3 1.0 1.0
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.
Separate data type analyses
For comparison, we can see if we would have managed to separate the tolerant from non tolerant recipients as well without integrating the data:
CyTOF:
We apply PCA only on the 24 CyTOF features:
Variability captured by each component: 
Since we have only two CyTOF features, the separation is not that good:
The secondary tolerant recipients are badly separated from the primary tolerant recipients compared to the integrative PCA. 

Transcriptomics:
Variability captured by each component: 
We can see how these principal components are correlated with tolerance. The fourth PC is very correlated with the tolerance. The PC1 is the 2nd mostly correlated with tolerance. 
The non tolerant recipients are slightly better separated from the tolerant recipients than in the integrative PCA. 

We can try to build a model that would predict a patients output (tolerant or not) based on the PCs:
## L2 Regularized Linear Support Vector Machines with Class Weights
##
## 10 samples
## 9 predictor
## 2 classes: 'primary_tolerant', 'secondary_tolerant'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 8, 8, 8, 8, 8
## Resampling results across tuning parameters:
##
## cost Loss weight Accuracy Kappa
## 0.25 L1 1 0.9 0.8
## 0.25 L1 2 0.9 0.8
## 0.25 L1 3 0.9 0.8
## 0.25 L2 1 0.9 0.8
## 0.25 L2 2 0.9 0.8
## 0.25 L2 3 0.9 0.8
## 0.50 L1 1 0.9 0.8
## 0.50 L1 2 0.9 0.8
## 0.50 L1 3 0.9 0.8
## 0.50 L2 1 0.9 0.8
## 0.50 L2 2 0.9 0.8
## 0.50 L2 3 0.9 0.8
## 1.00 L1 1 0.9 0.8
## 1.00 L1 2 0.9 0.8
## 1.00 L1 3 0.9 0.8
## 1.00 L2 1 0.9 0.8
## 1.00 L2 2 0.9 0.8
## 1.00 L2 3 0.9 0.8
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were cost = 0.25, Loss = L1 and weight = 1.